Final Project - Chicago Crime Rate Prediction¶

Haobing Liu¶

Crime prediction is a complex problem because there are many random human factors involved. This report is about retracing historical crime events in Chicago and creating a model that attempts to make predictions about the number of crimes in the community. The predictions are based on multiple open data sources that were combined before. The primary data source is crime incident data from the Chicago Police Department for the period 2015-2019. In terms of features, I currently include socioeconomic features, temporal features (e.g., hour, day and month), and nearest neighbor features. The prediction model is a prediction of the crime rate for each community.

Part 1 - Data Wrangling¶

Before diving into prediction, I first need to collect and clean the dataset, and then preprocess the data to ensure that it is in a usable format.

In [1]:
import folium
import altair as alt
import numpy as np
import pandas as pd
import geopandas as gpd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import hvplot.pandas
import esri2gpd
import carto2gpd
import seaborn as sns
import osmnx as ox
import datetime
from IPython.display import IFrame
from matplotlib import pyplot as plt
from shapely.geometry import Point

# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# perform the K-Means fit
from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

# Pipelines
from sklearn.pipeline import make_pipeline

# Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder

# Neighbors
from sklearn.neighbors import NearestNeighbors
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.11.0-CAPI-1.17.0) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow.
  warnings.warn(

1.1 Crime data¶

Here, I collect crime data from 2015 to 2019. Some data from recent years were not selected because key features were missing. The dataset is in a tabular form and includes chronological, geographical and text data, which extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The dataset contains more than 1,000,000 records/rows of data.

In [2]:
# import 5-year crime data from chicago data portal
pd.set_option('display.max_columns', None)
url = 'https://data.cityofchicago.org/resource/kf95-mnd6.json?$limit=999999'
crime2016 = pd.read_json(url)
In [3]:
url = 'https://data.cityofchicago.org/resource/d62x-nvdr.json?$limit=999999'
crime2017 = pd.read_json(url)
In [4]:
url = 'https://data.cityofchicago.org/resource/3i3m-jwuy.json?$limit=999999'
crime2018 = pd.read_json(url)
In [5]:
url = 'https://data.cityofchicago.org/resource/w98m-zvie.json?$limit=999999'
crime2019 = pd.read_json(url)
In [6]:
url = 'https://data.cityofchicago.org/resource/vwwp-7yr9.json?$limit=999999'
crime2015 = pd.read_json(url)
In [7]:
frames = [crime2015, crime2016, crime2017, crime2018, crime2019]

# combine each year together and remove Na
allcrime = pd.concat(frames).dropna()
In [8]:
cols = ["case_number",
        "date",
        "location",
        "district",
        "block",
        "primary_type",
        "description",
        "location_description",
        "community_area",
        "latitude",
        "longitude",
        "ward",
        "year",
        "domestic",
        "beat",
        "arrest"
]

# Trim to these columns
allcrime = allcrime[cols]
In [9]:
# extract the year from date
allcrime["date"] = pd.to_datetime(allcrime["date"])
allcrime['year'] = pd.DatetimeIndex(allcrime['date']).year
allcrime['month'] = pd.DatetimeIndex(allcrime['date']).month
allcrime['hour'] = pd.DatetimeIndex(allcrime['date']).hour
allcrime['day'] = pd.DatetimeIndex(allcrime['date']).day
allcrime["dow"] = allcrime["date"].apply(lambda x: x.strftime('%A'))
allcrime['month'] = allcrime['month'].astype(str)
allcrime['year'] = allcrime['year'].astype(str)
allcrime["year_month"] = allcrime['year'].str.cat(allcrime['month'].str.zfill(2), sep="-")

# change data type
allcrime['community_area'] = allcrime['community_area'].astype(int)
allcrime['district'] = allcrime['district'].astype(int)
allcrime['ward'] = allcrime['ward'].astype(int)
allcrime['beat'] = allcrime['beat'].astype(int)
In [10]:
# creating a geometry column 
geometry = [Point(xy) for xy in zip(allcrime['longitude'], allcrime['latitude'])]

# coordinate reference system : WGS84
crs = {'init': 'epsg:4326'}

# creating a Geographic data frame 
allcrime = gpd.GeoDataFrame(allcrime, crs=crs, geometry=geometry)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/pyproj/crs/crs.py:130: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  in_crs_string = _prepare_from_proj_string(in_crs_string)
In [11]:
# add the count field
allcrime["count"] = 1

# data format transform
allcrime['primary_type'] = pd.Categorical(allcrime['primary_type'])
allcrime['description'] = pd.Categorical(allcrime['description'])
allcrime['location_description'] = pd.Categorical(allcrime['location_description'])
In [12]:
allcrime.head()
Out[12]:
case_number date location district block primary_type description location_description community_area latitude longitude ward year domestic beat arrest month hour day dow year_month geometry count
0 HZ100370 2015-12-31 23:59:00 {'latitude': '41.757366519', 'human_address': ... 6 075XX S EMERALD AVE CRIMINAL DAMAGE TO VEHICLE STREET 68 41.757367 -87.642993 17 2015 False 621 False 12 23 31 Thursday 2015-12 POINT (-87.64299 41.75737) 1
2 HZ100006 2015-12-31 23:55:00 {'latitude': '41.751270452', 'human_address': ... 4 079XX S STONY ISLAND AVE BATTERY AGGRAVATED: OTHER DANG WEAPON STREET 45 41.751270 -87.585822 8 2015 False 411 False 12 23 31 Thursday 2015-12 POINT (-87.58582 41.75127) 1
3 HZ100010 2015-12-31 23:50:00 {'latitude': '42.016804165', 'human_address': ... 24 024XX W FARGO AVE THEFT $500 AND UNDER APARTMENT 2 42.016804 -87.690709 50 2015 False 2411 False 12 23 31 Thursday 2015-12 POINT (-87.69071 42.01680) 1
4 HZ100002 2015-12-31 23:50:00 {'latitude': '41.949837364', 'human_address': ... 19 037XX N CLARK ST BATTERY SIMPLE SIDEWALK 6 41.949837 -87.658635 44 2015 False 1923 True 12 23 31 Thursday 2015-12 POINT (-87.65864 41.94984) 1
5 HZ102701 2015-12-31 23:45:00 {'latitude': '41.910469677', 'human_address': ... 25 050XX W CONCORD PL CRIMINAL DAMAGE TO PROPERTY APARTMENT 25 41.910470 -87.751597 37 2015 False 2533 False 12 23 31 Thursday 2015-12 POINT (-87.75160 41.91047) 1

In this project, the crime dataset from 2015 to 2019 owns 1303648 rows and 23 columns, and also includes 34 types of crimes, which are multiple categorical values.

In [13]:
print('Dataset shape ->', allcrime.shape)
Dataset shape -> (1303648, 23)
In [14]:
crime = allcrime["primary_type"].unique()
list(crime)
Out[14]:
['CRIMINAL DAMAGE',
 'BATTERY',
 'THEFT',
 'OTHER OFFENSE',
 'ROBBERY',
 'MOTOR VEHICLE THEFT',
 'WEAPONS VIOLATION',
 'CRIMINAL SEXUAL ASSAULT',
 'ASSAULT',
 'BURGLARY',
 'DECEPTIVE PRACTICE',
 'CRIM SEXUAL ASSAULT',
 'NARCOTICS',
 'CRIMINAL TRESPASS',
 'OFFENSE INVOLVING CHILDREN',
 'INTERFERENCE WITH PUBLIC OFFICER',
 'SEX OFFENSE',
 'ARSON',
 'PUBLIC PEACE VIOLATION',
 'PROSTITUTION',
 'LIQUOR LAW VIOLATION',
 'HOMICIDE',
 'INTIMIDATION',
 'KIDNAPPING',
 'OBSCENITY',
 'STALKING',
 'PUBLIC INDECENCY',
 'NON-CRIMINAL',
 'NON - CRIMINAL',
 'CONCEALED CARRY LICENSE VIOLATION',
 'GAMBLING',
 'HUMAN TRAFFICKING',
 'OTHER NARCOTIC VIOLATION',
 'NON-CRIMINAL (SUBJECT SPECIFIED)']

1.2 Community data¶

The community dataset is from Chicago Data Portal. There are 77 communities in the Chicago City.

In [15]:
boundary = gpd.read_file('Boundaries - Community Areas (current).geojson')
boundary.head()
Out[15]:
community area shape_area perimeter area_num_1 area_numbe comarea_id comarea shape_len geometry
0 DOUGLAS 0 46004621.1581 0 35 35 0 0 31027.0545098 MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ...
1 OAKLAND 0 16913961.0408 0 36 36 0 0 19565.5061533 MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ...
2 FULLER PARK 0 19916704.8692 0 37 37 0 0 25339.0897503 MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ...
3 GRAND BOULEVARD 0 48492503.1554 0 38 38 0 0 28196.8371573 MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ...
4 KENWOOD 0 29071741.9283 0 39 39 0 0 23325.1679062 MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ...
In [16]:
community = boundary[["community", "area_numbe", "geometry"]].rename(columns={'area_numbe': 'community_area'}).drop(columns='geometry') 
community['community_area'] = community['community_area'].astype(int)
community.head()
Out[16]:
community community_area
0 DOUGLAS 35
1 OAKLAND 36
2 FULLER PARK 37
3 GRAND BOULEVARD 38
4 KENWOOD 39
In [17]:
allcrime = allcrime.merge(community, on="community_area")
allcrime.head()
Out[17]:
case_number date location district block primary_type description location_description community_area latitude longitude ward year domestic beat arrest month hour day dow year_month geometry count community
0 HZ100370 2015-12-31 23:59:00 {'latitude': '41.757366519', 'human_address': ... 6 075XX S EMERALD AVE CRIMINAL DAMAGE TO VEHICLE STREET 68 41.757367 -87.642993 17 2015 False 621 False 12 23 31 Thursday 2015-12 POINT (-87.64299 41.75737) 1 ENGLEWOOD
1 HZ100501 2015-12-31 22:00:00 {'latitude': '41.766639191', 'human_address': ... 7 070XX S PARNELL AVE CRIMINAL DAMAGE TO VEHICLE STREET 68 41.766639 -87.638388 6 2015 False 732 False 12 22 31 Thursday 2015-12 POINT (-87.63839 41.76664) 1 ENGLEWOOD
2 HZ100572 2015-12-31 21:00:00 {'latitude': '41.759381999', 'human_address': ... 7 074XX S GREEN ST MOTOR VEHICLE THEFT TRUCK, BUS, MOTOR HOME STREET 68 41.759382 -87.645475 17 2015 False 733 False 12 21 31 Thursday 2015-12 POINT (-87.64547 41.75938) 1 ENGLEWOOD
3 HY556424 2015-12-31 17:55:00 {'latitude': '41.77529175', 'human_address': '... 7 005XX W 65TH PL ROBBERY ARMED: HANDGUN ALLEY 68 41.775292 -87.637448 20 2015 False 722 False 12 17 31 Thursday 2015-12 POINT (-87.63745 41.77529) 1 ENGLEWOOD
4 HY556343 2015-12-31 16:49:00 {'latitude': '41.76228091', 'human_address': '... 7 072XX S ABERDEEN ST BATTERY AGGRAVATED: HANDGUN DRIVEWAY - RESIDENTIAL 68 41.762281 -87.651616 6 2015 False 733 False 12 16 31 Thursday 2015-12 POINT (-87.65162 41.76228) 1 ENGLEWOOD

1.3 Census Data¶

This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” by Chicago community area, for the years 2008 – 2012.

The indicators are the percent of occupied housing units with more than one person per room (i.e., crowded housing); the percent of households living below the federal poverty level; the percent of persons in the labor force over the age of 16 years that are unemployed; the percent of persons over the age of 25 years without a high school diploma; the percent of the population under 18 or over 64 years of age (i.e., dependency); and per capita income.

In [18]:
url = "https://data.cityofchicago.org/resource/kn9c-c2s2.csv"
census = pd.read_csv(url)
census = census.rename(columns={'ca': 'community_area'})
In [19]:
census['community_area'] = census['community_area'].astype("Int64")
allcrime['community_area'] = allcrime['community_area'].astype("Int64")
allcrime = pd.merge(allcrime, census,on="community_area")
allcrime.head()
Out[19]:
case_number date location district block primary_type description location_description community_area latitude longitude ward year domestic beat arrest month hour day dow year_month geometry count community community_area_name percent_of_housing_crowded percent_households_below_poverty percent_aged_16_unemployed percent_aged_25_without_high_school_diploma percent_aged_under_18_or_over_64 per_capita_income_ hardship_index
0 HZ100370 2015-12-31 23:59:00 {'latitude': '41.757366519', 'human_address': ... 6 075XX S EMERALD AVE CRIMINAL DAMAGE TO VEHICLE STREET 68 41.757367 -87.642993 17 2015 False 621 False 12 23 31 Thursday 2015-12 POINT (-87.64299 41.75737) 1 ENGLEWOOD Englewood 3.8 46.6 28.0 28.5 42.5 11888 94.0
1 HZ100501 2015-12-31 22:00:00 {'latitude': '41.766639191', 'human_address': ... 7 070XX S PARNELL AVE CRIMINAL DAMAGE TO VEHICLE STREET 68 41.766639 -87.638388 6 2015 False 732 False 12 22 31 Thursday 2015-12 POINT (-87.63839 41.76664) 1 ENGLEWOOD Englewood 3.8 46.6 28.0 28.5 42.5 11888 94.0
2 HZ100572 2015-12-31 21:00:00 {'latitude': '41.759381999', 'human_address': ... 7 074XX S GREEN ST MOTOR VEHICLE THEFT TRUCK, BUS, MOTOR HOME STREET 68 41.759382 -87.645475 17 2015 False 733 False 12 21 31 Thursday 2015-12 POINT (-87.64547 41.75938) 1 ENGLEWOOD Englewood 3.8 46.6 28.0 28.5 42.5 11888 94.0
3 HY556424 2015-12-31 17:55:00 {'latitude': '41.77529175', 'human_address': '... 7 005XX W 65TH PL ROBBERY ARMED: HANDGUN ALLEY 68 41.775292 -87.637448 20 2015 False 722 False 12 17 31 Thursday 2015-12 POINT (-87.63745 41.77529) 1 ENGLEWOOD Englewood 3.8 46.6 28.0 28.5 42.5 11888 94.0
4 HY556343 2015-12-31 16:49:00 {'latitude': '41.76228091', 'human_address': '... 7 072XX S ABERDEEN ST BATTERY AGGRAVATED: HANDGUN DRIVEWAY - RESIDENTIAL 68 41.762281 -87.651616 6 2015 False 733 False 12 16 31 Thursday 2015-12 POINT (-87.65162 41.76228) 1 ENGLEWOOD Englewood 3.8 46.6 28.0 28.5 42.5 11888 94.0

1.4 Merge Data¶

Finally, the data will be merged together. The above datasets are added to the original dataset.

In [20]:
# group by community
count=allcrime.groupby(["community"])["count"].sum()

# add arrest rate
arrest=allcrime.groupby(['community'])['arrest'].mean()

# add arrest domestic violence rate
domestic=allcrime.groupby(['community'])['domestic'].mean()

#types = allcrime.groupby(["community","primary_type"])["count"].sum().reset_index()
#types = pd.DataFrame(types).rename(columns={'count': 'crime_count'})

# transform dataframe from long to wide, add month column
month = allcrime.groupby(["community","month"])["count"].sum().reset_index()
month = pd.DataFrame(month).rename(columns={'count': 'month_count'})

days = allcrime.groupby(["community","dow"])["count"].sum().reset_index()
days = pd.DataFrame(days).rename(columns={'count': 'week_count'})

weapon_select=allcrime.groupby(["description","community"])['count'].sum().reset_index()
weapon_select=pd.DataFrame(weapon_select)
knife=weapon_select[weapon_select["description"].str.contains('KNIFE')]
gun=weapon_select[weapon_select["description"].str.contains('GUN')]
weapon=weapon_select[weapon_select["description"].str.contains('WEAPON')]
withweapon=pd.concat([knife, gun, weapon]).rename(columns={'count': 'weapon_count'})
withweapon=withweapon.groupby(["community"])['weapon_count'].sum().reset_index()
In [21]:
# merge together and get final cluster
crime=pd.merge(count,arrest,on="community")
crime=pd.merge(crime,domestic,on="community")
#crime=pd.merge(crime,types,on="community")
crime=pd.merge(crime,month,on="community")
crime=pd.merge(crime,days,on="community")
crime=pd.merge(crime,withweapon,on="community")
crime=crime.dropna()
crime
Out[21]:
community count arrest domestic month month_count dow week_count weapon_count
0 ALBANY PARK 11908 0.130752 0.141250 1 934 Friday 1634 1116
1 ALBANY PARK 11908 0.130752 0.141250 1 934 Monday 1756 1116
2 ALBANY PARK 11908 0.130752 0.141250 1 934 Saturday 1673 1116
3 ALBANY PARK 11908 0.130752 0.141250 1 934 Sunday 1708 1116
4 ALBANY PARK 11908 0.130752 0.141250 1 934 Thursday 1704 1116
... ... ... ... ... ... ... ... ... ...
6463 WOODLAWN 18007 0.218137 0.222691 9 1493 Saturday 2455 2206
6464 WOODLAWN 18007 0.218137 0.222691 9 1493 Sunday 2468 2206
6465 WOODLAWN 18007 0.218137 0.222691 9 1493 Thursday 2633 2206
6466 WOODLAWN 18007 0.218137 0.222691 9 1493 Tuesday 2651 2206
6467 WOODLAWN 18007 0.218137 0.222691 9 1493 Wednesday 2553 2206

6468 rows × 9 columns

In [22]:
crime_geo=boundary.merge(crime,on="community")
crime_geo=crime_geo.drop(columns=['area', 'shape_area','perimeter','area_num_1','comarea_id','comarea','shape_len']).rename(columns={'area_numbe': 'community_area'})
crime_geo=crime_geo.to_crs(3857)
In [23]:
crime_geo['community_area'] = crime_geo['community_area'].astype("Int64")
crime_geo=census.merge(crime_geo,on="community_area")
crime_geo
Out[23]:
community_area community_area_name percent_of_housing_crowded percent_households_below_poverty percent_aged_16_unemployed percent_aged_25_without_high_school_diploma percent_aged_under_18_or_over_64 per_capita_income_ hardship_index community geometry count arrest domestic month month_count dow week_count weapon_count
0 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Friday 2583 1309
1 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Monday 2744 1309
2 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Saturday 2580 1309
3 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Sunday 2549 1309
4 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Thursday 2522 1309
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6463 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Saturday 1739 682
6464 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Sunday 1591 682
6465 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Thursday 1738 682
6466 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Tuesday 1756 682
6467 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Wednesday 1786 682

6468 rows × 19 columns

In [24]:
crime_geo=gpd.GeoDataFrame(crime_geo)
crime_geo=crime_geo.to_crs(3857)
crime_geo
Out[24]:
community_area community_area_name percent_of_housing_crowded percent_households_below_poverty percent_aged_16_unemployed percent_aged_25_without_high_school_diploma percent_aged_under_18_or_over_64 per_capita_income_ hardship_index community geometry count arrest domestic month month_count dow week_count weapon_count
0 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Friday 2583 1309
1 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Monday 2744 1309
2 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Saturday 2580 1309
3 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Sunday 2549 1309
4 1 Rogers Park 7.7 23.6 8.7 18.2 27.5 23939 39.0 ROGERS PARK MULTIPOLYGON (((-9757660.529 5160704.746, -975... 18144 0.171627 0.140267 1 1378 Thursday 2522 1309
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6463 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Saturday 1739 682
6464 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Sunday 1591 682
6465 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Thursday 1738 682
6466 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Tuesday 1756 682
6467 77 Edgewater 4.1 18.2 9.2 9.7 23.8 33385 19.0 EDGEWATER MULTIPOLYGON (((-9757660.529 5160704.746, -975... 12266 0.159873 0.104598 9 1081 Wednesday 1786 682

6468 rows × 19 columns

Part 2 - Exploratory Analysis¶

In this section, I will explore the raw data. Based on crime records from previous years, this section is designed to show the basic information about crime in Chicago, such as the type of crime (e.g., wound, fatality etc.), the number of different types of crime, the location of crime, and the time series when the crime occurred. I will create some interactive charts by using altair, seaborn, numpy,plotly, andfolium function. The main point in this part is to indicate how significant location and time have effects on random crime.

2.1 Crimes Choropleth Map¶

In [25]:
# define basemap in folium
import folium
from folium.plugins import HeatMap


#define 
def get_neighborhood_style(feature):
    """Return a style dict."""
    return {"color": "#556158",
            "weight": 1.2}


def get_highlighted_style(feature):
    """Return a style dict when highlighting a feature."""
    return {"weight": 3, "color": "orange"}

def get_style(feature):
    """
    Given an input GeoJSON feature, return a style dict.
    
    Notes
    -----
    The color in the style dict is determined by the 
    "index_normalized" column in the input "feature".
    """
    # Get the data value from the feature
    value = feature['properties']['index_normalized']
    
    # Evaluate the color map
    # NOTE: value must between 0 and 1
    rgb_color = cmap(value) # this is an RGB tuple
    
    # Convert to hex string
    color = mcolors.rgb2hex(rgb_color)
    
    # Return the style dictionary
    return {'weight': 3, 
            'color': color, 
            'fillColor': color, 
            "fillOpacity": 1}
In [26]:
community_count = allcrime.groupby(["community"])["count"].sum()
community_count = pd.DataFrame(community_count).reset_index().sort_values('count', ascending=False)
community_count.head()
Out[26]:
community count
5 AUSTIN 77134
47 NEAR NORTH SIDE 56461
41 LOOP 47802
49 NEAR WEST SIDE 44469
52 NORTH LAWNDALE 43908
In [27]:
# STEP 1: Initialize the map
m = folium.Map(
    location=[41.8, -87.7],
    tiles='stamentoner',
    zoom_start=10.5
)

title_html = '''
             <h3 align="center" style="font-size:24px"><b>Choropleth of Crimes in Community, Chicago</b></h3>
             '''

# STEP 2: Add the community GeoJson to the map
folium.GeoJson(
    boundary,
    style_function=get_neighborhood_style,
    tooltip=folium.GeoJsonTooltip(['community'])
).add_to(m)

# STEP 3: Add the community count to the map
folium.Choropleth(
    geo_data=boundary,
    data=community_count,
    columns=['community','count'],
    key_on='feature.properties.community',
    fill_color='YlOrBr',
    fill_opacity=0.7,
    line_color="transparent",
    line_weight=0,
    highlight=True
).add_to(m)

m.get_root().html.add_child(folium.Element(title_html))
m.save("folium_choropleth_community_crimes.html")
m
Out[27]:
Make this Notebook Trusted to load map: File -> Trust Notebook

By folium, I'm able to create an interactive map to simply show the accumulated crime events in the community over 5 years. As shown on the map, the crime incidents were concentrated in several neighborhoods in the city and several neighborhoods near the southside.

2.2 Crime Count¶

In [28]:
crime_count = allcrime.groupby(["primary_type"])["count"].sum()
crime_count = pd.DataFrame(crime_count).reset_index().sort_values('count', ascending=True)
In [29]:
# Create the histogram
fig = px.bar(crime_count, 
             x="count", 
             y="primary_type",
             color="count", 
             color_continuous_scale='thermal',
             template="plotly_dark",
             title="Crime Count in Chicago (2015 - 2019)")
fig.update_layout(
    xaxis=dict(
        title='Count',
        tickfont_size=12
    ),
    yaxis=dict(
        title='Crimes Type',
        titlefont_size=12,
        tickfont_size=6,
    )
)

fig.write_html("chart_crime_count.html")
# Show the plot
fig.show()

The top ten crimes in terms of crime records were theft,battery, criminal damage, assult, other offense, deceptive practice, narcotics, burglary, moto vehicle theft and robbery.

2.3 Location Count¶

In [30]:
location_count = allcrime.groupby(["location_description"])["count"].sum()
location_count = pd.DataFrame(location_count).reset_index().sort_values('count', ascending=False)
location_count = location_count.iloc[0:20]
In [31]:
# Create the histogram
fig = px.bar(location_count, 
             x="count", 
             y="location_description",
             color="count", 
             color_continuous_scale='thermal',
             template="plotly_dark",
             title="Top 20 Location of Crimes in Chicago (2015 - 2019)")
fig.update_layout(
    xaxis=dict(
        title='Count',
        tickfont_size=12
    ),
    yaxis=dict(
        title='Location',
        titlefont_size=12,
        tickfont_size=6,
    )
)

fig.write_html("chart_crime_location.html")
# Show the plot
fig.show()

2.4 Community Crime Count¶

In [32]:
community_crime_count = allcrime.groupby(["primary_type","community"])["count"].sum()
community_crime_count = pd.DataFrame(community_crime_count).reset_index().sort_values('count', ascending=True)
community_crime_count.head()
Out[32]:
primary_type community count
1644 NON-CRIMINAL (SUBJECT SPECIFIED) GARFIELD RIDGE 0
1653 NON-CRIMINAL (SUBJECT SPECIFIED) KENWOOD 0
1652 NON-CRIMINAL (SUBJECT SPECIFIED) JEFFERSON PARK 0
1651 NON-CRIMINAL (SUBJECT SPECIFIED) IRVING PARK 0
1650 NON-CRIMINAL (SUBJECT SPECIFIED) HYDE PARK 0
In [33]:
fig = px.scatter(community_crime_count,
                 x="count", 
                 y="community", 
                 size="count", 
                 color="primary_type",
                 template="plotly_dark",
                 title="Crimes Happened in Community (2015 - 2019)",
                 marginal_x="histogram")
fig.update_layout(
    xaxis=dict(
        title='Count',
        tickfont_size=12
    ),
    yaxis=dict(
        title='Community',
        titlefont_size=12,
        tickfont_size=6,
    )
)

fig.write_html("chart_crimes_community.html")
fig.show()

2.5 District Crime Count¶

In [34]:
district_count=allcrime.groupby(["district","primary_type"])["count"].sum()
district_count=pd.DataFrame(district_count).reset_index()
district_count.head()
Out[34]:
district primary_type count
0 1 ARSON 22
1 1 ASSAULT 3546
2 1 BATTERY 7699
3 1 BURGLARY 913
4 1 CONCEALED CARRY LICENSE VIOLATION 10
In [35]:
colormap1 = alt.Scale(domain=[0,5,25,125,250, 625,1250, 3125,6250, 12500, 15625],
                     range=['#461e5c','#72369d','#9d4edd','#b67be6','#cea7ee','#ffcf70','#ffc243','#ffad03','#f56a00','#bf5028'],
                     type='sqrt')

heatmap_d = alt.Chart(district_count).mark_rect().encode(
    alt.X('district:O', axis=alt.Axis(title="District", ticks=False)),
    alt.Y('primary_type:O',axis=alt.Axis(title="Crime", ticks=True)),
    alt.Color('count:Q',scale=colormap1),
    tooltip=['district:O', 'count:Q']
).properties(
    width=450,
    height=500,
    title={
      "text": ["Crimes By District (2015-2019)"], 
      "subtitle": ["in Chicago"]
    }
).interactive()

heatmap_d.save('heatmap_district.html')

heatmap_d
Out[35]:

2.6 Time Series¶

In [36]:
crime_annual=allcrime.groupby(["year_month","year","month"])["count"].sum()
crime_annual=pd.DataFrame(crime_annual).reset_index()
In [37]:
fig = px.bar(
    crime_annual,
    x="year_month",
    y="count",
    hover_data=["month", "count"],
    color="count",
    height=300,
    color_continuous_scale=px.colors.sequential.thermal,
    template="plotly_dark"
)
fig.update_layout(
    title_text='Histogram of Crimes By Month (2015 - 2019)',
    xaxis=dict(
        title='Month',
        tickfont_size=12
    ),
    yaxis=dict(
        title='Community',
        titlefont_size=12,
        tickfont_size=6,
    )
)

fig.write_html("chart_histogram_of_crimes_by_month.html")
fig.show()
In [38]:
community_annual=allcrime.groupby(["year_month","year","month","community"])["count"].sum()
community_annual=pd.DataFrame(community_annual).reset_index()
community_annual
Out[38]:
year_month year month community count
0 2015-01 2015 1 ALBANY PARK 179
1 2015-01 2015 1 ARCHER HEIGHTS 85
2 2015-01 2015 1 ARMOUR SQUARE 89
3 2015-01 2015 1 ASHBURN 197
4 2015-01 2015 1 AUBURN GRESHAM 616
... ... ... ... ... ...
4615 2019-12 2019 12 WEST LAWN 146
4616 2019-12 2019 12 WEST PULLMAN 309
4617 2019-12 2019 12 WEST RIDGE 284
4618 2019-12 2019 12 WEST TOWN 558
4619 2019-12 2019 12 WOODLAWN 251

4620 rows × 5 columns

2.7 Time Heatmap¶

In [39]:
colormap2 = alt.Scale(domain=[0,20,40,60,80,100,200,400,800,1200],
                     range=['#461e5c','#72369d','#9d4edd','#b67be6','#cea7ee','#ffcf70','#ffc243','#ffad03','#f56a00','#bf5028'],
                     type='sqrt')

heatmap_t = alt.Chart(community_annual).mark_rect().encode(
    alt.X('year_month:O', axis=alt.Axis(title="Month", ticks=False)),
    alt.Y('community:O',axis=alt.Axis(title="Community", ticks=True)),
    alt.Color('count:Q',scale=colormap2),
    tooltip=['year:O', 'count:Q']
).properties(
    width=650,
    height=700,
    title={
      "text": ["Crimes By Month in Community (2015-2019)"], 
      "subtitle": ["in Chicago"]
    }
).interactive()

heatmap_t.save('heatmap_time.html')
heatmap_t
Out[39]:
In [40]:
hour_count=allcrime.groupby(["hour","primary_type"])["count"].sum()
hour_count=pd.DataFrame(hour_count).reset_index().sort_values('count', ascending=False)
hour_count.head()
Out[40]:
hour primary_type count
440 12 THEFT 20729
644 18 THEFT 19984
542 15 THEFT 19897
610 17 THEFT 19855
508 14 THEFT 19411
In [41]:
colormap3 = alt.Scale(domain=[0,5,25,125,250,625,1250,3125,6250, 12500, 15625],
                     range=['#461e5c','#72369d','#9d4edd','#b67be6','#cea7ee','#ffcf70','#ffc243','#ffad03','#f56a00','#bf5028'],
                     type='sqrt')

heatmap_h = alt.Chart(hour_count).mark_rect().encode(
    alt.X('hour:O', axis=alt.Axis(title="Time", ticks=False)),
    alt.Y('primary_type:O',axis=alt.Axis(title="Crime", ticks=True)),
    alt.Color('count:Q',scale=colormap3),
    tooltip=['count:Q']
).properties(
    width=450,
    height=500,
    title={
      "text": ["Crimes By Hour (2015-2019)"], 
      "subtitle": ["in Chicago"]
    }
).interactive()

heatmap_h.save('heatmap_hour.html')
heatmap_h
Out[41]:

2.8 Arrest Rate¶

In this step, I get the average number of arrested perpetrators then calculate the arrest rate.

In [42]:
# groupby arrest then calculate the mean value
arrest = allcrime.groupby('primary_type')['arrest'].mean().reset_index()
# remove excess decimal noise by rounding and then multiply each value by 100 to get a percentage
arrest['arrest_rate'] = arrest['arrest'].round(4)*100
arrest = arrest.sort_values('arrest_rate', ascending=False)
arrest.tail(n=15)
Out[42]:
primary_type arrest arrest_rate
23 OFFENSE INVOLVING CHILDREN 0.151739 15.17
12 HUMAN TRAFFICKING 0.142857 14.29
31 STALKING 0.136364 13.64
7 CRIMINAL SEXUAL ASSAULT 0.125224 12.52
14 INTIMIDATION 0.108516 10.85
0 ARSON 0.104429 10.44
32 THEFT 0.103988 10.40
5 CRIM SEXUAL ASSAULT 0.091111 9.11
29 ROBBERY 0.084881 8.49
17 MOTOR VEHICLE THEFT 0.079040 7.90
15 KIDNAPPING 0.069793 6.98
6 CRIMINAL DAMAGE 0.060331 6.03
20 NON-CRIMINAL 0.059701 5.97
9 DECEPTIVE PRACTICE 0.053185 5.32
3 BURGLARY 0.052050 5.20
In [43]:
# Create the histogram
fig = px.bar(arrest, 
             x="arrest_rate", 
             y="primary_type",
             color="arrest_rate", 
             color_continuous_scale='thermal',
             template="plotly_dark",
             title="Crimes Arrest Rate (2015 - 2019)")
fig.update_layout(
    xaxis=dict(
        title='Rate(%)',
        tickfont_size=12
    ),
    yaxis=dict(
        title='Crimes Type',
        titlefont_size=12,
        tickfont_size=6,
    )
)

# Show the plot
fig.write_html("chart_crime_arrest_rate.html")
fig.show()

2.9 Domestic Violence¶

In [44]:
# groupby domestic violence then calculate the mean value
domestic = allcrime.groupby('primary_type')['domestic'].mean().reset_index()
domestic = domestic.sort_values('domestic', ascending=False)
# get the percent of domestic violence
domestic['domestic_rate'] = domestic['domestic'].round(4)*100
domestic.head()
Out[44]:
primary_type domestic domestic_rate
21 NON-CRIMINAL (SUBJECT SPECIFIED) 0.833333 83.33
23 OFFENSE INVOLVING CHILDREN 0.537047 53.70
2 BATTERY 0.487777 48.78
31 STALKING 0.456710 45.67
25 OTHER OFFENSE 0.318453 31.85
In [45]:
# Create the histogram
fig = px.bar(domestic[0:20], 
             x="domestic_rate", 
             y="primary_type",
             color="domestic_rate", 
             color_continuous_scale='thermal',
             template="plotly_dark",
             title="Top 20 Crimes Domestic Violence Rate (2015 - 2019)")
fig.update_layout(
    xaxis=dict(
        title='Rate(%)',
        tickfont_size=12
    ),
    yaxis=dict(
        title='Crimes Type',
        titlefont_size=12,
        tickfont_size=6,
    )
)

# Show the plot
fig.write_html("chart_domestic_violence_rate.html")
fig.show()

Part 3 - Feature Engineering¶

3.1 Density Analysis¶

In [46]:
import geoplot as gplt

crimes = ['ROBBERY','THEFT','BURGLARY','WEAPONS VIOLATION','NARCOTICS','ASSAULT','HOMICIDE','SEX OFFENSE','KIDNAPPING']
cmap = sns.cubehelix_palette(light=1, as_cmap=True)

fig, ax = plt.subplots(3, 3, sharex=True, sharey=True, figsize=(12,12))
for i , crimes in enumerate(np.random.choice(crimes, size=9, replace=False)):
    data = allcrime.loc[allcrime['primary_type'] == crimes ]
    ax = fig.add_subplot(3, 3, i+1)
    gplt.polyplot(boundary, 
                  ax=ax,
                  edgecolor="grey",
                  linewidth=0.2,
                  facecolor="white")
    gplt.kdeplot(data,
                 shade=True,
                 shade_lowest=False,
                 cmap=cmap,
                 alpha=0.5,
                 ax=ax)
    ax.set_title(crimes)
    
plt.suptitle('Density of Different Crimes in Chicago')
plt.figure(facecolor='black')
plt.savefig('density_of_different_crimes',bbox_inches='tight')
plt.show()
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geoplot/geoplot.py:885: ShapelyDeprecationWarning:

Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/distributions.py:1718: UserWarning:

`shade_lowest` is now deprecated in favor of `thresh`. Setting `thresh=0.05`, but please update your code.

<Figure size 640x480 with 0 Axes>

3.2 Cluster Analysis¶

In [47]:
# group by community
cluster1=allcrime.groupby(["community"])["count"].sum()

# add arrest rate
cluster2=allcrime.groupby(['community'])['arrest'].mean()

# add arrest domestic violence rate
cluster3=allcrime.groupby(['community'])['domestic'].mean()

cluster4 = allcrime.groupby(["community","primary_type"])["count"].sum().reset_index()
cluster4 = pd.DataFrame(cluster4).rename(columns={'count': 'crime_count'})
cluster4 =pd.pivot(
    cluster4,
    index=['community'], 
    columns = 'primary_type',
    values = 'crime_count'
)

# transform dataframe from long to wide, add month column
cluster5 = allcrime.groupby(["community","month"])["count"].sum().reset_index()
cluster5 = pd.DataFrame(cluster5).rename(columns={'count': 'month_count'})
cluster5 =pd.pivot(
    cluster5,
    index=['community'], 
    columns = 'month',
    values = 'month_count'
)

cluster6 = allcrime.groupby(["community","dow"])["count"].sum().reset_index()
cluster6 = pd.DataFrame(cluster6).rename(columns={'count': 'week_count'})
cluster6 =pd.pivot(
    cluster6,
    index=['community'], 
    columns = 'dow',
    values = 'week_count'
)
In [48]:
cluster_select=allcrime.groupby(["description","community"])['count'].sum().reset_index()
cluster_select=pd.DataFrame(cluster_select)
knife=cluster_select[cluster_select["description"].str.contains('KNIFE')]
gun=cluster_select[cluster_select["description"].str.contains('GUN')]
weapon=cluster_select[cluster_select["description"].str.contains('WEAPON')]
cluster7=pd.concat([knife, gun, weapon]).rename(columns={'count': 'weapon_count'})
cluster7=cluster7.groupby(["community"])['weapon_count'].sum().reset_index()
In [49]:
# merge together and get final cluster
cluster=pd.merge(cluster1,cluster2,on="community")
cluster=pd.merge(cluster,cluster3,on="community")
cluster=pd.merge(cluster,cluster4,on="community")
cluster=pd.merge(cluster,cluster5,on="community")
cluster=pd.merge(cluster,cluster6,on="community")
cluster=pd.merge(cluster,cluster7,on="community")
cluster=cluster.dropna()
cluster.head()
Out[49]:
community count arrest domestic ARSON ASSAULT BATTERY BURGLARY CONCEALED CARRY LICENSE VIOLATION CRIM SEXUAL ASSAULT CRIMINAL DAMAGE CRIMINAL SEXUAL ASSAULT CRIMINAL TRESPASS DECEPTIVE PRACTICE GAMBLING HOMICIDE HUMAN TRAFFICKING INTERFERENCE WITH PUBLIC OFFICER INTIMIDATION KIDNAPPING LIQUOR LAW VIOLATION MOTOR VEHICLE THEFT NARCOTICS NON - CRIMINAL NON-CRIMINAL NON-CRIMINAL (SUBJECT SPECIFIED) OBSCENITY OFFENSE INVOLVING CHILDREN OTHER NARCOTIC VIOLATION OTHER OFFENSE PROSTITUTION PUBLIC INDECENCY PUBLIC PEACE VIOLATION ROBBERY SEX OFFENSE STALKING THEFT WEAPONS VIOLATION 1 10 11 12 2 3 4 5 6 7 8 9 Friday Monday Saturday Sunday Thursday Tuesday Wednesday weapon_count
0 ALBANY PARK 11908 0.130752 0.141250 18 757 2236 703 1 59 1640 11 227 620 2 14 1 15 9 8 11 698 280 0 2 0 2 104 0 700 11 1 56 587 80 6 2909 140 934 1062 922 979 849 985 977 1067 1047 1002 1052 1032 1634 1756 1673 1708 1704 1735 1698 1116
1 ARCHER HEIGHTS 4310 0.166357 0.137355 8 293 709 338 1 16 595 6 79 259 0 9 0 5 6 10 4 256 128 0 1 0 4 38 0 225 12 1 37 137 25 4 1048 56 350 365 370 336 371 351 328 400 359 337 375 368 663 627 610 524 578 669 639 299
2 ARMOUR SQUARE 5081 0.161582 0.101752 2 344 939 213 0 16 507 3 180 422 2 5 0 11 5 1 16 214 100 0 1 0 0 21 0 176 1 0 16 434 21 3 1361 67 392 422 429 399 307 362 369 457 477 509 492 466 783 707 757 767 681 682 704 653
3 ASHBURN 11614 0.126830 0.171948 21 905 2055 784 0 42 1669 9 246 791 10 30 0 20 12 15 5 786 254 3 0 0 12 122 1 882 0 0 71 464 40 12 2219 134 982 1011 1006 1013 786 893 911 972 997 995 1071 977 1775 1696 1523 1548 1710 1666 1696 970
4 AUBURN GRESHAM 37599 0.245485 0.240060 71 3454 8470 2016 22 190 4402 22 1016 1698 28 111 1 285 9 32 23 1328 2101 0 1 0 2 396 0 2883 61 1 275 1668 90 28 5727 1188 2888 3372 2916 2851 2627 2968 3085 3440 3247 3521 3438 3246 5463 5383 5309 5295 5382 5389 5378 4965
In [50]:
kmeans=KMeans(n_clusters=5)

scaled=scaler.fit_transform(cluster[
    ['count',
     'arrest',
     'domestic',
     'count',
     'arrest',
     'domestic',
     'weapon_count',
     'Friday',
     'Monday',
     'Saturday',
     'Sunday',
     'Thursday',
     'Tuesday',
     'Wednesday']
])
scaled.mean(axis=0)
scaled.std(axis=0)
Out[50]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
In [51]:
kmeans.fit(scaled)
cluster['label']=kmeans.labels_
In [52]:
kmeans.inertia_
Out[52]:
244.55318408291294
In [53]:
cluster_geo=boundary.merge(cluster,on="community")
cluster_geo=cluster_geo.drop(columns=['area', 'shape_area','perimeter','area_num_1','comarea_id','comarea','shape_len']).rename(columns={'area_numbe': 'community_area'})
cluster_geo=cluster_geo.to_crs(3857)
cluster_geo['label']=cluster_geo['label'].fillna(-1)
cluster_geo.head()
Out[53]:
community community_area geometry count arrest domestic ARSON ASSAULT BATTERY BURGLARY CONCEALED CARRY LICENSE VIOLATION CRIM SEXUAL ASSAULT CRIMINAL DAMAGE CRIMINAL SEXUAL ASSAULT CRIMINAL TRESPASS DECEPTIVE PRACTICE GAMBLING HOMICIDE HUMAN TRAFFICKING INTERFERENCE WITH PUBLIC OFFICER INTIMIDATION KIDNAPPING LIQUOR LAW VIOLATION MOTOR VEHICLE THEFT NARCOTICS NON - CRIMINAL NON-CRIMINAL NON-CRIMINAL (SUBJECT SPECIFIED) OBSCENITY OFFENSE INVOLVING CHILDREN OTHER NARCOTIC VIOLATION OTHER OFFENSE PROSTITUTION PUBLIC INDECENCY PUBLIC PEACE VIOLATION ROBBERY SEX OFFENSE STALKING THEFT WEAPONS VIOLATION 1 10 11 12 2 3 4 5 6 7 8 9 Friday Monday Saturday Sunday Thursday Tuesday Wednesday weapon_count label
0 DOUGLAS 35 MULTIPOLYGON (((-9752604.951 5137743.450, -975... 12994 0.192012 0.145375 5 1095 2658 262 3 64 1202 12 516 887 3 20 1 27 11 9 3 477 353 1 2 0 3 80 0 925 1 1 69 621 55 11 3481 136 1051 1242 1063 965 900 1057 1023 1100 1144 1151 1120 1178 1896 1981 1675 1626 1880 2008 1928 1108 0
1 OAKLAND 36 MULTIPOLYGON (((-9750713.852 5133595.674, -975... 3228 0.091388 0.243185 5 309 612 101 1 18 455 3 84 178 0 9 0 3 1 3 0 155 37 0 1 0 3 33 0 302 0 0 7 119 14 3 720 52 249 294 223 237 188 245 261 290 321 309 348 263 457 466 491 468 437 461 448 314 4
2 FULLER PARK 37 MULTIPOLYGON (((-9754793.199 5131350.020, -975... 4362 0.249427 0.149473 7 377 835 113 1 23 440 6 188 246 1 16 1 49 5 3 0 198 244 1 0 1 2 44 0 220 0 0 22 248 15 3 938 115 309 381 327 299 283 327 368 385 451 428 376 428 647 618 646 634 598 633 586 613 4
3 GRAND BOULEVARD 38 MULTIPOLYGON (((-9752334.139 5133578.410, -975... 16251 0.179927 0.207987 10 1329 3616 645 1 67 1909 13 322 856 4 41 2 72 5 10 13 772 671 0 0 0 9 132 0 1162 63 1 69 766 57 14 3356 264 1137 1402 1336 1239 1079 1270 1281 1470 1497 1537 1574 1429 2378 2339 2282 2192 2374 2335 2351 1765 4
4 KENWOOD 39 MULTIPOLYGON (((-9750713.852 5133595.674, -975... 7018 0.105443 0.188800 7 565 1281 289 0 41 830 10 184 528 0 14 0 6 7 3 0 323 103 0 1 0 4 44 0 523 0 0 20 367 33 4 1770 61 596 636 583 585 459 536 515 640 619 669 614 566 1048 1070 958 916 1018 1030 978 659 4
In [54]:
cluster_geo.groupby("label").size()
Out[54]:
label
0    31
1    17
2     5
3     1
4    23
dtype: int64
In [55]:
cluster_geo.groupby("label")['count'].mean()
Out[55]:
label
0    10043.032258
1    30559.000000
2    42966.600000
3    77134.000000
4     7862.782609
Name: count, dtype: float64
In [56]:
import h3
import libpysal
from cenpy import products
from tobler.util import h3fy
from tobler.area_weighted import area_interpolate

import contextily as ctx
from cenpy import products

cluster_geo_hex = h3fy(cluster_geo, resolution=8)
cluster_geo_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['label'])
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

In [57]:
cluster_geo['label'] = cluster_geo['label'].astype('category')
In [58]:
fig, axs = plt.subplots(1,2, figsize=(18,10))

cluster_geo.plot('label', cmap='magma', alpha=0.6, linewidth=0.6, edgecolor='white', ax=axs[0],legend=True)
cluster_geo_hex.plot('label', cmap='magma', alpha=0.6, linewidth=0.6, edgecolor='white', ax=axs[1], legend=True, legend_kwds=dict(loc='upper right'),scheme="Quantiles", k=5)

axs[0].set_title('Community Clusters')
axs[1].set_title('Hex Clusters')

for ax in axs:
    ax.axis('off')
    ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.Toner)
plt.suptitle('Spatial Cluster Analysis of Crime in Chicago', fontsize=20)
plt.savefig('spatial_cluster_analysis',bbox_inches='tight')

3.3 Amenities Features¶

In [59]:
crime_3857 = crime_geo.to_crs(3857)
cluster_geo = cluster_geo.to_crs(3857)
In [60]:
def get_xy_from_geometry(df):
    """
    Return a numpy array with two columns, where the 
    first holds the `x` geometry coordinate and the second 
    column holds the `y` geometry coordinate
    
    Note: this works with both Point() and Polygon() objects.
    """
    # NEW: use the centroid.x and centroid.y to support Polygon() and Point() geometries 
    x = df.geometry.centroid.x
    y = df.geometry.centroid.y
    
    return np.column_stack((x, y)) # stack as columns

# Extract x/y for sales
crimeXY = get_xy_from_geometry(crime_3857)
crime_polygon = get_xy_from_geometry(cluster_geo)

(1) Depository Locations¶

In [61]:
#========crime point=======

# Get the data
url = "https://data.cityofchicago.org/resource/g5wi-brhg.geojson"
deposit = gpd.read_file(url)
deposit = deposit[deposit.geometry.notnull()]

# Get the X/Y
depositXY = get_xy_from_geometry(deposit.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(depositXY)
depositDists, _ = nbrs.kneighbors(crimeXY)

# Add the new feature
crime_3857["logDistDeposit"] = np.log10(depositDists.mean(axis=1))


#========community polygon=======
# Get the X/Y
depositXY = get_xy_from_geometry(deposit.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(depositXY)
depositDists, _ = nbrs.kneighbors(crime_polygon)

# Add the new feature
cluster_geo["logDistDeposit"] = np.log10(depositDists.mean(axis=1))

(2) Abandon Buildings¶

In [62]:
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/kc9i-wq85.geojson"
abbuilding = gpd.read_file(url)
abbuilding = abbuilding[abbuilding.geometry.notnull()]

# Get the X/Y
abbuildingXY = get_xy_from_geometry(abbuilding.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(abbuildingXY)
abbuildingDists, _ = nbrs.kneighbors(crimeXY)

# Add the new feature
crime_3857["logDistAbbuilding"] = np.log10(abbuildingDists.mean(axis=1))


#========community polygon=======
# Get the X/Y
abbuildingXY = get_xy_from_geometry(abbuilding.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(abbuildingXY)
abbuildingDists, _ = nbrs.kneighbors(crime_polygon)

# Add the new feature
cluster_geo["logDistAbbuilding"] = np.log10(abbuildingDists.mean(axis=1))

(3) Schools¶

In [63]:
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/tz49-n8ze.geojson"
school = gpd.read_file(url)
school = school[school.geometry.notnull()]

# Get the X/Y
schoolXY = get_xy_from_geometry(school.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(schoolXY)
schoolDists, _ = nbrs.kneighbors(crimeXY)

# Add the new feature
crime_3857["logDistSchool"] = np.log10(schoolDists.mean(axis=1))


#========community polygon=======
# Get the X/Y
schoolXY = get_xy_from_geometry(school.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(schoolXY)
schoolDists, _ = nbrs.kneighbors(crime_polygon)

# Add the new feature
cluster_geo["logDistSchool"] = np.log10(schoolDists.mean(axis=1))

(4) Grocery Stores¶

In [64]:
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/53t8-wyrc.geojson"
grocery = gpd.read_file(url)
grocery = grocery[grocery.geometry.notnull()]

# Get the X/Y
groceryXY = get_xy_from_geometry(grocery.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(groceryXY)
groceryDists, _ = nbrs.kneighbors(crimeXY)

# Add the new feature
crime_3857["logDistGrocery"] = np.log10(groceryDists.mean(axis=1))


#========community polygon=======
# Get the X/Y
groceryXY = get_xy_from_geometry(grocery.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(groceryXY)
groceryDists, _ = nbrs.kneighbors(crime_polygon)

# Add the new feature
cluster_geo["logDistGrocery"] = np.log10(groceryDists.mean(axis=1))

(5) Subway Stations¶

In [65]:
chicago_outline = boundary.geometry.unary_union
chicago_outline
Out[65]:
b''
In [66]:
# Get the subway stops within the city limits
subway = ox.geometries_from_polygon(chicago_outline, tags={"station": "subway"})
# Convert to 3857 (meters)
subway = subway.to_crs(epsg=3857)
In [67]:
#========crime point=======
# STEP 1: x/y coordinates of subway stops (in EPGS=3857)
subwayXY = get_xy_from_geometry(subway.to_crs(epsg=3857))

# STEP 2: Initialize the algorithm, use  𝑘=5  to get the distance to the nearest stop.
nbrs = NearestNeighbors(n_neighbors=5)

# STEP 3: Fit the algorithm on the "neighbors" dataset
nbrs.fit(subwayXY)

# STEP 4: Get distances for sale to neighbors
subwayDists, subwayIndices = nbrs.kneighbors(crimeXY)

# STEP 5: add back to the original dataset
crime_3857["logDistSubway"] = np.log10(subwayDists.mean(axis=1))


#========community polygon=======
# STEP 1: x/y coordinates of subway stops (in EPGS=3857)
subwayXY = get_xy_from_geometry(subway.to_crs(epsg=3857))

# STEP 2: Initialize the algorithm, use  𝑘=5  to get the distance to the nearest stop.
nbrs = NearestNeighbors(n_neighbors=5)

# STEP 3: Fit the algorithm on the "neighbors" dataset
nbrs.fit(subwayXY)

# STEP 4: Get distances for sale to neighbors
subwayDists, subwayIndices = nbrs.kneighbors(crime_polygon)

# STEP 5: add back to the original dataset
cluster_geo["logDistSubway"] = np.log10(subwayDists.mean(axis=1))

(6) Retails and Commercials¶

In [68]:
#========crime point=======
# Get the subway stops within the city limits
retail = ox.geometries_from_polygon(chicago_outline, tags = {'landuse':['retail','commercial']})
# Convert to 3857 (meters)
retail = retail.to_crs(epsg=3857)

# Get the X/Y
retailXY = get_xy_from_geometry(retail.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(retailXY)
retailDists, _ = nbrs.kneighbors(crimeXY)

# Add the new feature
crime_3857["logDistRetail"] = np.log10(retailDists.mean(axis=1))



#========community polygon=======
# Get the X/Y
retailXY = get_xy_from_geometry(retail.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(retailXY)
retailDists, _ = nbrs.kneighbors(crime_polygon)

# Add the new feature
cluster_geo["logDistRetail"] = np.log10(retailDists.mean(axis=1))

(7) Landmarks¶

In [69]:
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/tdab-kixi.geojson"
landmark = gpd.read_file(url)
#landmark = landmark[landmark.geometry.notnull()]

# Get the X/Y
landmarkXY = get_xy_from_geometry(landmark.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=2)
nbrs.fit(landmarkXY)
landmarkDists, _ = nbrs.kneighbors(crimeXY)

# Add the new feature
crime_3857["logDistLandmark"] = np.log10(landmarkDists.mean(axis=1))


#========community polygon=======
# Get the X/Y
landmarkXY = get_xy_from_geometry(landmark.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=2)
nbrs.fit(landmarkXY)
landmarkDists, _ = nbrs.kneighbors(crime_polygon)

# Add the new feature
cluster_geo["logDistLandmark"] = np.log10(landmarkDists.mean(axis=1))

(8) Shotspotter Alerts¶

In [70]:
#========crime point=======
# Get the data
url = "https://data.cityofchicago.org/resource/3h7q-7mdb.geojson"
alerts = gpd.read_file(url)
alerts = alerts[alerts.geometry.notnull()]

# Get the X/Y
alertsXY = get_xy_from_geometry(alerts.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(alertsXY)
alertsDists, _ = nbrs.kneighbors(crimeXY)

# Add the new feature
crime_3857["logDistAlerts"] = np.log10(alertsDists.mean(axis=1))


#========community polygon=======
# Get the X/Y
alertsXY = get_xy_from_geometry(alerts.to_crs(epsg=3857))

# Run the k nearest algorithm
nbrs = NearestNeighbors(n_neighbors=2)
nbrs.fit(alertsXY)
alertsDists, _ = nbrs.kneighbors(crime_polygon)

# Add the new feature
cluster_geo["logDistAlerts"] = np.log10(alertsDists.mean(axis=1))

3.4 Hexgrids¶

In [71]:
cluster_geo_hex = h3fy(cluster_geo, resolution=8)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

In [72]:
# Plot all features in the same figure
fig, axs = plt.subplots(ncols=2,nrows=4, figsize=(16,28))

# logDistDeposit
ax=axs[0][0]
logDistDeposit_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistDeposit'])
logDistDeposit_hex = logDistDeposit_hex.to_crs(3857)
logDistDeposit_hex.plot('logDistDeposit', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Despository Locations',fontsize=12)

# logDistAbbuilding
ax=axs[0][1]
logDistAbbuilding_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistAbbuilding'])
logDistAbbuilding_hex = logDistAbbuilding_hex.to_crs(3857)
logDistAbbuilding_hex.plot('logDistAbbuilding', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Abandon Buildings',fontsize=12)

# logDistSchool
ax=axs[1][0]
logDistSchool_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistSchool'])
logDistSchool_hex = logDistSchool_hex.to_crs(3857)
logDistSchool_hex.plot('logDistSchool', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Schools',fontsize=12)

# logDistGrocery
ax=axs[1][1]
logDistGrocery_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistGrocery'])
logDistGrocery_hex = logDistGrocery_hex.to_crs(3857)
logDistGrocery_hex.plot('logDistGrocery', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Grocery Stores',fontsize=12)

# logDistSubway
ax=axs[2][0]
logDistSubway_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistSubway'])
logDistSubway_hex = logDistSubway_hex.to_crs(3857)
logDistSubway_hex.plot('logDistSubway', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Subway Stations',fontsize=12)

# logDistRetail
ax=axs[2][1]
logDistRetail_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistRetail'])
logDistRetail_hex = logDistRetail_hex.to_crs(3857)
logDistRetail_hex.plot('logDistRetail', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Retails and Commercials',fontsize=12)

# logDistLandmark
ax=axs[3][0]
logDistLandmark_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistLandmark'])
logDistLandmark_hex = logDistLandmark_hex.to_crs(3857)
logDistLandmark_hex.plot('logDistLandmark', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Landmarks',fontsize=12)


# logDistAlerts
ax=axs[3][1]
logDistAlerts_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['logDistAlerts'])
logDistAlerts_hex = logDistAlerts_hex.to_crs(3857)
logDistAlerts_hex.plot('logDistAlerts', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Nearest Neighbors - Log Distance of Shotspotter Alerts',fontsize=12)

plt.savefig('hexgrids_nn',bbox_inches='tight')
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

In [73]:
cluster_geo['community_area'] = cluster_geo['community_area'].astype("Int64")
cluster_geo = pd.merge(cluster_geo, census, on = "community_area")
In [74]:
# The overall crime rate of a county is calculated by dividing the total number of Index crimes submitted by police agencies in each county by the county’s population
# Then multiply the result by 100,000. 
# The U.S. Census Bureau is the source of county population data.
cluster_geo['count'] = cluster_geo['count'].astype("Int64")
cluster_geo['crime_index'] = cluster_geo['count'] / 2740000 * 100000
cluster_geo['people_risk_index'] = cluster_geo['count'] / (cluster_geo['percent_aged_under_18_or_over_64'] * 2740000) * 100000
cluster_geo['weapon_rate'] = cluster_geo['weapon_count']/ cluster_geo['count']

cluster_geo['weapon_rate'] = cluster_geo['weapon_rate'].apply(pd.to_numeric)
cluster_geo['crime_index'] = cluster_geo['crime_index'].apply(pd.to_numeric)
cluster_geo['people_risk_index'] = cluster_geo['people_risk_index'].apply(pd.to_numeric)
cluster_geo.head()
Out[74]:
community community_area geometry count arrest domestic ARSON ASSAULT BATTERY BURGLARY CONCEALED CARRY LICENSE VIOLATION CRIM SEXUAL ASSAULT CRIMINAL DAMAGE CRIMINAL SEXUAL ASSAULT CRIMINAL TRESPASS DECEPTIVE PRACTICE GAMBLING HOMICIDE HUMAN TRAFFICKING INTERFERENCE WITH PUBLIC OFFICER INTIMIDATION KIDNAPPING LIQUOR LAW VIOLATION MOTOR VEHICLE THEFT NARCOTICS NON - CRIMINAL NON-CRIMINAL NON-CRIMINAL (SUBJECT SPECIFIED) OBSCENITY OFFENSE INVOLVING CHILDREN OTHER NARCOTIC VIOLATION OTHER OFFENSE PROSTITUTION PUBLIC INDECENCY PUBLIC PEACE VIOLATION ROBBERY SEX OFFENSE STALKING THEFT WEAPONS VIOLATION 1 10 11 12 2 3 4 5 6 7 8 9 Friday Monday Saturday Sunday Thursday Tuesday Wednesday weapon_count label logDistDeposit logDistAbbuilding logDistSchool logDistGrocery logDistSubway logDistRetail logDistLandmark logDistAlerts community_area_name percent_of_housing_crowded percent_households_below_poverty percent_aged_16_unemployed percent_aged_25_without_high_school_diploma percent_aged_under_18_or_over_64 per_capita_income_ hardship_index crime_index people_risk_index weapon_rate
0 DOUGLAS 35 MULTIPOLYGON (((-9752604.951 5137743.450, -975... 12994 0.192012 0.145375 5 1095 2658 262 3 64 1202 12 516 887 3 20 1 27 11 9 3 477 353 1 2 0 3 80 0 925 1 1 69 621 55 11 3481 136 1051 1242 1063 965 900 1057 1023 1100 1144 1151 1120 1178 1896 1981 1675 1626 1880 2008 1928 1108 0 3.103828 3.518327 2.745583 2.960215 3.293262 2.758428 2.391839 3.068095 Douglas 1.8 29.6 18.2 14.3 30.7 23791 47.0 474.233577 15.447348 0.085270
1 OAKLAND 36 MULTIPOLYGON (((-9750713.852 5133595.674, -975... 3228 0.091388 0.243185 5 309 612 101 1 18 455 3 84 178 0 9 0 3 1 3 0 155 37 0 1 0 3 33 0 302 0 0 7 119 14 3 720 52 249 294 223 237 188 245 261 290 321 309 348 263 457 466 491 468 437 461 448 314 4 3.310177 3.500816 3.025448 3.136585 3.413476 2.857741 2.959350 2.940911 Oakland 1.3 39.7 28.7 18.4 40.4 19252 78.0 117.810219 2.916095 0.097274
2 FULLER PARK 37 MULTIPOLYGON (((-9754793.199 5131350.020, -975... 4362 0.249427 0.149473 7 377 835 113 1 23 440 6 188 246 1 16 1 49 5 3 0 198 244 1 0 1 2 44 0 220 0 0 22 248 15 3 938 115 309 381 327 299 283 327 368 385 451 428 376 428 647 618 646 634 598 633 586 613 4 3.478687 2.817223 2.926504 3.185645 3.185514 2.244567 3.102937 2.718943 Fuller Park 3.2 51.2 33.9 26.6 44.9 10432 97.0 159.197080 3.545592 0.140532
3 GRAND BOULEVARD 38 MULTIPOLYGON (((-9752334.139 5133578.410, -975... 16251 0.179927 0.207987 10 1329 3616 645 1 67 1909 13 322 856 4 41 2 72 5 10 13 772 671 0 0 0 9 132 0 1162 63 1 69 766 57 14 3356 264 1137 1402 1336 1239 1079 1270 1281 1470 1497 1537 1574 1429 2378 2339 2282 2192 2374 2335 2351 1765 4 3.241109 3.121504 2.892320 2.805637 3.054997 2.562950 2.731433 2.856737 Grand Boulevard 3.3 29.3 24.3 15.9 39.5 23472 57.0 593.102190 15.015245 0.108609
4 KENWOOD 39 MULTIPOLYGON (((-9750713.852 5133595.674, -975... 7018 0.105443 0.188800 7 565 1281 289 0 41 830 10 184 528 0 14 0 6 7 3 0 323 103 0 1 0 4 44 0 523 0 0 20 367 33 4 1770 61 596 636 583 585 459 536 515 640 619 669 614 566 1048 1070 958 916 1018 1030 978 659 4 2.637666 3.548489 2.595956 2.929183 3.466788 2.487147 2.820633 2.996621 Kenwood 2.4 21.7 15.7 11.3 35.4 35911 26.0 256.131387 7.235350 0.093901
In [75]:
cluster_geo_hex = h3fy(cluster_geo, resolution=8)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

In [76]:
# Plot all features in the same figure
fig, axs = plt.subplots(ncols=2,nrows=3, figsize=(12,18))

# percent_of_housing_crowded
ax=axs[0][0]
percent_of_housing_crowded_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['crime_index'])
percent_of_housing_crowded_hex = percent_of_housing_crowded_hex.to_crs(3857)
percent_of_housing_crowded_hex.plot('crime_index', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Crime Rate',fontsize=12)

# percent_households_below_poverty
ax=axs[0][1]
percent_households_below_poverty_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['people_risk_index'])
percent_households_below_poverty_hex = percent_households_below_poverty_hex.to_crs(3857)
percent_households_below_poverty_hex.plot('people_risk_index', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Risk Rate of People (under 18 or over 64)',fontsize=12)

# per_capita_income_
ax=axs[1][0]
percent_of_housing_crowded_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['percent_of_housing_crowded'])
percent_of_housing_crowded_hex = percent_of_housing_crowded_hex.to_crs(3857)
percent_of_housing_crowded_hex.plot('percent_of_housing_crowded', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Percent of Housing Crowded',fontsize=12)

# hardship_index
ax=axs[1][1]
hardship_index_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['hardship_index'])
hardship_index_hex = hardship_index_hex.to_crs(3857)
hardship_index_hex.plot('hardship_index', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Hardship Index',fontsize=12)


ax=axs[2][0]
count_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['domestic'])
count_hex = count_hex.to_crs(3857)
count_hex.plot('domestic', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Domestic Rate',fontsize=12)


ax=axs[2][1]
weapon_hex = area_interpolate(source_df=cluster_geo, target_df=cluster_geo_hex, intensive_variables=['weapon_rate'])
weapon_hex = weapon_hex.to_crs(3857)
weapon_hex.plot('weapon_rate', cmap='magma_r', alpha=0.6, linewidth=0.4, edgecolor="white", ax=ax, legend=True)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Weapon Rate',fontsize=12)


plt.savefig('hexgrids_features',bbox_inches='tight')
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

3.5 Correlation Matrix¶

In [77]:
crime_3857['count'] = crime_3857['count'].astype("Int64")
crime_3857['crime_index'] = crime_3857['count'] / 2740000 * 100000
crime_3857['people_risk_index'] = crime_3857['count'] / (cluster_geo['percent_aged_under_18_or_over_64'] * 2740000) * 100000
crime_3857['weapon_rate'] = crime_3857['weapon_count']/ cluster_geo['count']

crime_3857['weapon_rate'] = crime_3857['weapon_rate'].apply(pd.to_numeric).replace(np.nan, 0)
crime_3857['crime_index'] = crime_3857['crime_index'].apply(pd.to_numeric).replace(np.nan, 0)
crime_3857['people_risk_index'] = crime_3857['people_risk_index'].apply(pd.to_numeric).replace(np.nan, 0)
In [78]:
#select the columns I want to use
feature_cols = [
    'community',
    'month_count',
    'month',
    'dow',
    'week_count',
    'weapon_count',
    'count',
    'crime_index',
    'weapon_rate',
    'people_risk_index',
    'domestic',
    'arrest',
    'percent_of_housing_crowded',
    'percent_households_below_poverty',
    'percent_aged_16_unemployed',
    'percent_aged_25_without_high_school_diploma',
    'percent_aged_under_18_or_over_64',
    'per_capita_income_',
    'hardship_index',
    "logDistDeposit",
    "logDistAbbuilding",
    "logDistSchool",
    "logDistGrocery",
    "logDistSubway",
    "logDistRetail",
    "logDistAlerts",
    "logDistLandmark"]
In [79]:
crime = crime_3857[feature_cols]
crime
Out[79]:
community month_count month dow week_count weapon_count count crime_index weapon_rate people_risk_index domestic arrest percent_of_housing_crowded percent_households_below_poverty percent_aged_16_unemployed percent_aged_25_without_high_school_diploma percent_aged_under_18_or_over_64 per_capita_income_ hardship_index logDistDeposit logDistAbbuilding logDistSchool logDistGrocery logDistSubway logDistRetail logDistAlerts logDistLandmark
0 ROGERS PARK 1378 1 Friday 2583 1309 18144 662.189781 0.100739 21.569700 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921
1 ROGERS PARK 1378 1 Monday 2744 1309 18144 662.189781 0.405514 16.390836 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921
2 ROGERS PARK 1378 1 Saturday 2580 1309 18144 662.189781 0.300092 14.748102 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921
3 ROGERS PARK 1378 1 Sunday 2549 1309 18144 662.189781 0.080549 16.764298 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921
4 ROGERS PARK 1378 1 Thursday 2522 1309 18144 662.189781 0.186520 18.705926 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6463 EDGEWATER 1081 9 Saturday 1739 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576
6464 EDGEWATER 1081 9 Sunday 1591 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576
6465 EDGEWATER 1081 9 Thursday 1738 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576
6466 EDGEWATER 1081 9 Tuesday 1756 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576
6467 EDGEWATER 1081 9 Wednesday 1786 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576

6468 rows × 27 columns

In [80]:
plt.figure(figsize=(10,8))
sns.set(font_scale=0.6)
sns.heatmap(crime.corr(), cmap="magma", annot=True, vmin=-1, vmax=1, alpha=0.8, linewidths=0.5, annot_kws={'size': 6})

plt.savefig('correlation_matrix')
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/seaborn/rcmod.py:400: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

Part 4 - Predictive Modeling of Crime Count¶

In [81]:
# Numerical columns
num_cols = [
    'percent_of_housing_crowded',
    'percent_households_below_poverty',
    'percent_aged_16_unemployed',
    'percent_aged_25_without_high_school_diploma',
    'percent_aged_under_18_or_over_64',
    'per_capita_income_',
    'hardship_index',
    "logDistDeposit",
    "logDistAbbuilding",
    "logDistSchool",
    "logDistGrocery",
    "logDistSubway",
    "logDistRetail",
    "logDistLandmark",
    "logDistAlerts"
]

# Categorical columns
cat_cols = [
    'dow',
    "community",
    ]
In [82]:
# Split the data 
train_set, test_set = train_test_split(crime, test_size=0.3, random_state=42)

# the target labels: log of count
y_train = np.log(crime['crime_index'])
y_test = np.log(crime['crime_index'])
In [83]:
common_samples = set(train_set.index).intersection(set(y_train.index))
In [84]:
train_set = train_set.loc[common_samples]
y_train = y_train.loc[common_samples]
/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/4101391604.py:1: FutureWarning:

Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.

/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/4101391604.py:2: FutureWarning:

Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.

In [85]:
common_samples2 = set(test_set.index).intersection(set(y_test.index))
In [86]:
test_set = test_set.loc[common_samples2]
y_test = y_test.loc[common_samples2]
/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/2220761477.py:1: FutureWarning:

Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.

/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/2220761477.py:2: FutureWarning:

Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.

In [87]:
# Set up the column transformer with two transformers
transformer = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)

# Initialize the pipeline
pipe = make_pipeline(
    transformer, RandomForestRegressor(random_state=42)
)
In [88]:
# Make the grid of parameters to search
# NOTE: you must prepend the name of the pipeline step
model_name = "randomforestregressor"
param_grid = {
    f"{model_name}__n_estimators": [5, 50, 100, 150, 200],
    f"{model_name}__max_depth": [2, 5, 7, 9, 13]
}

param_grid
Out[88]:
{'randomforestregressor__n_estimators': [5, 50, 100, 150, 200],
 'randomforestregressor__max_depth': [2, 5, 7, 9, 13]}
In [89]:
# Create the grid and use 5-fold CV
grid = GridSearchCV(pipe, param_grid, verbose=5, cv=3, error_score='raise')

# Run the search
grid.fit(train_set, y_train)
Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=5;, score=0.344 total time=   0.0s
[CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=5;, score=0.125 total time=   0.0s
[CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=5;, score=0.236 total time=   0.0s
[CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=50;, score=0.343 total time=   0.2s
[CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=50;, score=0.126 total time=   0.1s
[CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=50;, score=0.331 total time=   0.1s
[CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=100;, score=0.339 total time=   0.2s
[CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=100;, score=0.125 total time=   0.2s
[CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=100;, score=0.339 total time=   0.3s
[CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=150;, score=0.336 total time=   0.4s
[CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=150;, score=0.123 total time=   0.3s
[CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=150;, score=0.350 total time=   0.3s
[CV 1/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=200;, score=0.337 total time=   0.5s
[CV 2/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=200;, score=0.122 total time=   0.5s
[CV 3/3] END randomforestregressor__max_depth=2, randomforestregressor__n_estimators=200;, score=0.345 total time=   0.4s
[CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=5;, score=0.429 total time=   0.0s
[CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=5;, score=0.085 total time=   0.0s
[CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=5;, score=-0.298 total time=   0.0s
[CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=50;, score=0.377 total time=   0.3s
[CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=50;, score=0.094 total time=   0.3s
[CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=50;, score=0.145 total time=   0.3s
[CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=100;, score=0.383 total time=   0.6s
[CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=100;, score=0.087 total time=   0.9s
[CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=100;, score=0.175 total time=   0.6s
[CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=150;, score=0.373 total time=   0.8s
[CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=150;, score=0.076 total time=   0.9s
[CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=150;, score=0.165 total time=   0.8s
[CV 1/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=200;, score=0.374 total time=   1.2s
[CV 2/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=200;, score=0.070 total time=   1.4s
[CV 3/3] END randomforestregressor__max_depth=5, randomforestregressor__n_estimators=200;, score=0.164 total time=   1.2s
[CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=5;, score=0.296 total time=   0.1s
[CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=5;, score=0.130 total time=   0.1s
[CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=5;, score=-0.185 total time=   0.1s
[CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=50;, score=0.298 total time=   0.5s
[CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=50;, score=0.104 total time=   0.5s
[CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=50;, score=0.173 total time=   0.5s
[CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=100;, score=0.286 total time=   0.9s
[CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=100;, score=0.106 total time=   0.9s
[CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=100;, score=0.211 total time=   0.9s
[CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=150;, score=0.278 total time=   1.4s
[CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=150;, score=0.096 total time=   1.4s
[CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=150;, score=0.216 total time=   1.6s
[CV 1/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=200;, score=0.277 total time=   1.9s
[CV 2/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=200;, score=0.085 total time=   1.8s
[CV 3/3] END randomforestregressor__max_depth=7, randomforestregressor__n_estimators=200;, score=0.216 total time=   1.8s
[CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=5;, score=0.352 total time=   0.1s
[CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=5;, score=0.036 total time=   0.1s
[CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=5;, score=-0.066 total time=   0.1s
[CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=50;, score=0.309 total time=   0.9s
[CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=50;, score=0.089 total time=   0.6s
[CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=50;, score=0.204 total time=   0.6s
[CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=100;, score=0.291 total time=   1.2s
[CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=100;, score=0.091 total time=   1.3s
[CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=100;, score=0.251 total time=   1.2s
[CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=150;, score=0.276 total time=   1.8s
[CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=150;, score=0.083 total time=   2.0s
[CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=150;, score=0.247 total time=   2.1s
[CV 1/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=200;, score=0.278 total time=   2.5s
[CV 2/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=200;, score=0.078 total time=   2.6s
[CV 3/3] END randomforestregressor__max_depth=9, randomforestregressor__n_estimators=200;, score=0.238 total time=   2.6s
[CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=5;, score=0.309 total time=   0.1s
[CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=5;, score=0.144 total time=   0.1s
[CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=5;, score=-0.195 total time=   0.1s
[CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=50;, score=0.299 total time=   0.7s
[CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=50;, score=0.103 total time=   0.8s
[CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=50;, score=0.205 total time=   0.7s
[CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=100;, score=0.292 total time=   1.6s
[CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=100;, score=0.089 total time=   1.6s
[CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=100;, score=0.245 total time=   1.4s
[CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=150;, score=0.282 total time=   2.1s
[CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=150;, score=0.080 total time=   2.2s
[CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=150;, score=0.246 total time=   2.4s
[CV 1/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=200;, score=0.281 total time=   2.7s
[CV 2/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=200;, score=0.073 total time=   3.0s
[CV 3/3] END randomforestregressor__max_depth=13, randomforestregressor__n_estimators=200;, score=0.240 total time=   3.2s
Out[89]:
GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['percent_of_housing_crowded',
                                                                          'percent_households_below_poverty',
                                                                          'percent_aged_16_unemployed',
                                                                          'percent_aged_25_without_high_school_diploma',
                                                                          'percent_aged_under_18_or_over_64',
                                                                          'per_capita_income_',
                                                                          'hardship_ind...
                                                                          'logDistGrocery',
                                                                          'logDistSubway',
                                                                          'logDistRetail',
                                                                          'logDistLandmark',
                                                                          'logDistAlerts']),
                                                                        ('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['dow',
                                                                          'community'])])),
                                       ('randomforestregressor',
                                        RandomForestRegressor(random_state=42))]),
             param_grid={'randomforestregressor__max_depth': [2, 5, 7, 9, 13],
                         'randomforestregressor__n_estimators': [5, 50, 100,
                                                                 150, 200]},
             verbose=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['percent_of_housing_crowded',
                                                                          'percent_households_below_poverty',
                                                                          'percent_aged_16_unemployed',
                                                                          'percent_aged_25_without_high_school_diploma',
                                                                          'percent_aged_under_18_or_over_64',
                                                                          'per_capita_income_',
                                                                          'hardship_ind...
                                                                          'logDistGrocery',
                                                                          'logDistSubway',
                                                                          'logDistRetail',
                                                                          'logDistLandmark',
                                                                          'logDistAlerts']),
                                                                        ('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['dow',
                                                                          'community'])])),
                                       ('randomforestregressor',
                                        RandomForestRegressor(random_state=42))]),
             param_grid={'randomforestregressor__max_depth': [2, 5, 7, 9, 13],
                         'randomforestregressor__n_estimators': [5, 50, 100,
                                                                 150, 200]},
             verbose=5)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['percent_of_housing_crowded',
                                                   'percent_households_below_poverty',
                                                   'percent_aged_16_unemployed',
                                                   'percent_aged_25_without_high_school_diploma',
                                                   'percent_aged_under_18_or_over_64',
                                                   'per_capita_income_',
                                                   'hardship_index',
                                                   'logDistDeposit',
                                                   'logDistAbbuilding',
                                                   'logDistSchool',
                                                   'logDistGrocery',
                                                   'logDistSubway',
                                                   'logDistRetail',
                                                   'logDistLandmark',
                                                   'logDistAlerts']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['dow', 'community'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=42))])
ColumnTransformer(transformers=[('num', StandardScaler(),
                                 ['percent_of_housing_crowded',
                                  'percent_households_below_poverty',
                                  'percent_aged_16_unemployed',
                                  'percent_aged_25_without_high_school_diploma',
                                  'percent_aged_under_18_or_over_64',
                                  'per_capita_income_', 'hardship_index',
                                  'logDistDeposit', 'logDistAbbuilding',
                                  'logDistSchool', 'logDistGrocery',
                                  'logDistSubway', 'logDistRetail',
                                  'logDistLandmark', 'logDistAlerts']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['dow', 'community'])])
['percent_of_housing_crowded', 'percent_households_below_poverty', 'percent_aged_16_unemployed', 'percent_aged_25_without_high_school_diploma', 'percent_aged_under_18_or_over_64', 'per_capita_income_', 'hardship_index', 'logDistDeposit', 'logDistAbbuilding', 'logDistSchool', 'logDistGrocery', 'logDistSubway', 'logDistRetail', 'logDistLandmark', 'logDistAlerts']
StandardScaler()
['dow', 'community']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
In [90]:
# Evaluate the best random forest model
best_random = grid.best_estimator_
best_random
Out[90]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['percent_of_housing_crowded',
                                                   'percent_households_below_poverty',
                                                   'percent_aged_16_unemployed',
                                                   'percent_aged_25_without_high_school_diploma',
                                                   'percent_aged_under_18_or_over_64',
                                                   'per_capita_income_',
                                                   'hardship_index',
                                                   'logDistDeposit',
                                                   'logDistAbbuilding',
                                                   'logDistSchool',
                                                   'logDistGrocery',
                                                   'logDistSubway',
                                                   'logDistRetail',
                                                   'logDistLandmark',
                                                   'logDistAlerts']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['dow', 'community'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=2, n_estimators=150,
                                       random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['percent_of_housing_crowded',
                                                   'percent_households_below_poverty',
                                                   'percent_aged_16_unemployed',
                                                   'percent_aged_25_without_high_school_diploma',
                                                   'percent_aged_under_18_or_over_64',
                                                   'per_capita_income_',
                                                   'hardship_index',
                                                   'logDistDeposit',
                                                   'logDistAbbuilding',
                                                   'logDistSchool',
                                                   'logDistGrocery',
                                                   'logDistSubway',
                                                   'logDistRetail',
                                                   'logDistLandmark',
                                                   'logDistAlerts']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['dow', 'community'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=2, n_estimators=150,
                                       random_state=42))])
ColumnTransformer(transformers=[('num', StandardScaler(),
                                 ['percent_of_housing_crowded',
                                  'percent_households_below_poverty',
                                  'percent_aged_16_unemployed',
                                  'percent_aged_25_without_high_school_diploma',
                                  'percent_aged_under_18_or_over_64',
                                  'per_capita_income_', 'hardship_index',
                                  'logDistDeposit', 'logDistAbbuilding',
                                  'logDistSchool', 'logDistGrocery',
                                  'logDistSubway', 'logDistRetail',
                                  'logDistLandmark', 'logDistAlerts']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['dow', 'community'])])
['percent_of_housing_crowded', 'percent_households_below_poverty', 'percent_aged_16_unemployed', 'percent_aged_25_without_high_school_diploma', 'percent_aged_under_18_or_over_64', 'per_capita_income_', 'hardship_index', 'logDistDeposit', 'logDistAbbuilding', 'logDistSchool', 'logDistGrocery', 'logDistSubway', 'logDistRetail', 'logDistLandmark', 'logDistAlerts']
StandardScaler()
['dow', 'community']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=2, n_estimators=150, random_state=42)
In [91]:
grid.best_estimator_["randomforestregressor"]
Out[91]:
RandomForestRegressor(max_depth=2, n_estimators=150, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_depth=2, n_estimators=150, random_state=42)
In [92]:
grid.best_params_
Out[92]:
{'randomforestregressor__max_depth': 2,
 'randomforestregressor__n_estimators': 150}
In [93]:
grid.score(test_set, y_test)
Out[93]:
0.5438955252961919
In [94]:
pipe.fit(train_set, y_train)
Out[94]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['percent_of_housing_crowded',
                                                   'percent_households_below_poverty',
                                                   'percent_aged_16_unemployed',
                                                   'percent_aged_25_without_high_school_diploma',
                                                   'percent_aged_under_18_or_over_64',
                                                   'per_capita_income_',
                                                   'hardship_index',
                                                   'logDistDeposit',
                                                   'logDistAbbuilding',
                                                   'logDistSchool',
                                                   'logDistGrocery',
                                                   'logDistSubway',
                                                   'logDistRetail',
                                                   'logDistLandmark',
                                                   'logDistAlerts']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['dow', 'community'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['percent_of_housing_crowded',
                                                   'percent_households_below_poverty',
                                                   'percent_aged_16_unemployed',
                                                   'percent_aged_25_without_high_school_diploma',
                                                   'percent_aged_under_18_or_over_64',
                                                   'per_capita_income_',
                                                   'hardship_index',
                                                   'logDistDeposit',
                                                   'logDistAbbuilding',
                                                   'logDistSchool',
                                                   'logDistGrocery',
                                                   'logDistSubway',
                                                   'logDistRetail',
                                                   'logDistLandmark',
                                                   'logDistAlerts']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['dow', 'community'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=42))])
ColumnTransformer(transformers=[('num', StandardScaler(),
                                 ['percent_of_housing_crowded',
                                  'percent_households_below_poverty',
                                  'percent_aged_16_unemployed',
                                  'percent_aged_25_without_high_school_diploma',
                                  'percent_aged_under_18_or_over_64',
                                  'per_capita_income_', 'hardship_index',
                                  'logDistDeposit', 'logDistAbbuilding',
                                  'logDistSchool', 'logDistGrocery',
                                  'logDistSubway', 'logDistRetail',
                                  'logDistLandmark', 'logDistAlerts']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['dow', 'community'])])
['percent_of_housing_crowded', 'percent_households_below_poverty', 'percent_aged_16_unemployed', 'percent_aged_25_without_high_school_diploma', 'percent_aged_under_18_or_over_64', 'per_capita_income_', 'hardship_index', 'logDistDeposit', 'logDistAbbuilding', 'logDistSchool', 'logDistGrocery', 'logDistSubway', 'logDistRetail', 'logDistLandmark', 'logDistAlerts']
StandardScaler()
['dow', 'community']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
In [95]:
# Get the R2 on the test set!
best_random.score(test_set, y_test) 
Out[95]:
0.5438955252961919
In [96]:
ohe = transformer.named_transformers_['cat']
ohe
Out[96]:
OneHotEncoder(handle_unknown='ignore')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder(handle_unknown='ignore')
In [97]:
# One column for each category type!
ohe_cols = ohe.get_feature_names_out()
ohe_cols
Out[97]:
array(['dow_Friday', 'dow_Monday', 'dow_Saturday', 'dow_Sunday',
       'dow_Thursday', 'dow_Tuesday', 'dow_Wednesday',
       'community_ALBANY PARK', 'community_ARCHER HEIGHTS',
       'community_ARMOUR SQUARE', 'community_ASHBURN',
       'community_AUBURN GRESHAM', 'community_AUSTIN',
       'community_AVALON PARK', 'community_AVONDALE',
       'community_BELMONT CRAGIN', 'community_BEVERLY',
       'community_BRIDGEPORT', 'community_BRIGHTON PARK',
       'community_BURNSIDE', 'community_CALUMET HEIGHTS',
       'community_CHATHAM', 'community_CHICAGO LAWN',
       'community_CLEARING', 'community_DOUGLAS', 'community_DUNNING',
       'community_EAST GARFIELD PARK', 'community_EAST SIDE',
       'community_EDGEWATER', 'community_EDISON PARK',
       'community_ENGLEWOOD', 'community_FOREST GLEN',
       'community_FULLER PARK', 'community_GAGE PARK',
       'community_GARFIELD RIDGE', 'community_GRAND BOULEVARD',
       'community_GREATER GRAND CROSSING', 'community_HEGEWISCH',
       'community_HERMOSA', 'community_HUMBOLDT PARK',
       'community_HYDE PARK', 'community_IRVING PARK',
       'community_JEFFERSON PARK', 'community_KENWOOD',
       'community_LAKE VIEW', 'community_LINCOLN PARK',
       'community_LINCOLN SQUARE', 'community_LOGAN SQUARE',
       'community_LOOP', 'community_LOWER WEST SIDE',
       'community_MCKINLEY PARK', 'community_MONTCLARE',
       'community_MORGAN PARK', 'community_MOUNT GREENWOOD',
       'community_NEAR NORTH SIDE', 'community_NEAR SOUTH SIDE',
       'community_NEAR WEST SIDE', 'community_NEW CITY',
       'community_NORTH CENTER', 'community_NORTH LAWNDALE',
       'community_NORTH PARK', 'community_NORWOOD PARK',
       'community_OAKLAND', 'community_OHARE', 'community_PORTAGE PARK',
       'community_PULLMAN', 'community_RIVERDALE',
       'community_ROGERS PARK', 'community_ROSELAND',
       'community_SOUTH CHICAGO', 'community_SOUTH DEERING',
       'community_SOUTH LAWNDALE', 'community_SOUTH SHORE',
       'community_UPTOWN', 'community_WASHINGTON HEIGHTS',
       'community_WASHINGTON PARK', 'community_WEST ELSDON',
       'community_WEST ENGLEWOOD', 'community_WEST GARFIELD PARK',
       'community_WEST LAWN', 'community_WEST PULLMAN',
       'community_WEST RIDGE', 'community_WEST TOWN',
       'community_WOODLAWN'], dtype=object)
In [98]:
features = num_cols + list(ohe_cols)
random_forest = pipe["randomforestregressor"]

# Create the dataframe with importances
importance = pd.DataFrame(
    {"Feature": features, "Importance": random_forest.feature_importances_}
)
In [99]:
# Sort by importance and get the top 20, and sort in a descending order
importance = importance.sort_values("Importance", ascending=False).iloc[:20]

# Plot
importance.hvplot.barh(x="Feature", y="Importance", color="purple", alpha=0.5, edgecolor="purple", height=500, flip_yaxis=True)
WARNING:param.main: edgecolor option not found for barh plot with bokeh; similar options include: ['color', 'color', 'muted_color']
Out[99]:

Part 5 - Evaluation¶

In [100]:
# Make the predictions
predictions = best_random.predict(test_set)

# MAE
# NOTE: we convert from log(sale_price) to sale_price using exp()
errors = (np.exp(predictions) - np.exp(y_test)) / np.exp(y_test) * 100
In [101]:
#extract out the test set data
crime_test = crime.loc[test_set.index]

#include the percent error for all of the sales in the test set
crime_test['percent_error'] =  errors
crime_test['prediction'] = np.exp(predictions)
crime_test['abs_percent_error']=abs(crime_test['percent_error'])
crime_test
Out[101]:
community month_count month dow week_count weapon_count count crime_index weapon_rate people_risk_index domestic arrest percent_of_housing_crowded percent_households_below_poverty percent_aged_16_unemployed percent_aged_25_without_high_school_diploma percent_aged_under_18_or_over_64 per_capita_income_ hardship_index logDistDeposit logDistAbbuilding logDistSchool logDistGrocery logDistSubway logDistRetail logDistAlerts logDistLandmark percent_error prediction abs_percent_error
8 ROGERS PARK 1649 10 Monday 2744 1309 18144 662.189781 0.072694 18.343207 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921 -34.095150 436.415181 34.095150
12 ROGERS PARK 1649 10 Tuesday 2547 1309 18144 662.189781 0.292057 16.979225 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921 -34.095150 436.415181 34.095150
14 ROGERS PARK 1384 11 Friday 2583 1309 18144 662.189781 0.082162 19.476170 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921 -34.095150 436.415181 34.095150
15 ROGERS PARK 1384 11 Monday 2744 1309 18144 662.189781 0.092411 20.955373 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921 -34.095150 436.415181 34.095150
17 ROGERS PARK 1384 11 Sunday 2549 1309 18144 662.189781 0.450138 17.155176 0.140267 0.171627 7.7 23.6 8.7 18.2 27.5 23939 39.0 3.188012 3.891309 2.598127 2.655374 3.154793 2.589933 4.108093 3.099921 -34.095150 436.415181 34.095150
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6460 EDGEWATER 1183 8 Wednesday 1786 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576 -2.512832 436.415181 2.512832
6461 EDGEWATER 1081 9 Friday 1908 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576 -2.512832 436.415181 2.512832
6462 EDGEWATER 1081 9 Monday 1748 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576 -2.512832 436.415181 2.512832
6464 EDGEWATER 1081 9 Sunday 1591 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576 -2.512832 436.415181 2.512832
6465 EDGEWATER 1081 9 Thursday 1738 682 12266 447.664234 0.000000 0.000000 0.104598 0.159873 4.1 18.2 9.2 9.7 23.8 33385 19.0 2.708248 3.773897 2.771839 2.893563 3.087525 2.728233 4.023814 2.520576 -2.512832 436.415181 2.512832

1941 rows × 30 columns

In [102]:
error_test = crime_test.groupby("community")["abs_percent_error"].mean().reset_index()
error_test = boundary.merge(error_test,on="community")
In [103]:
error_test = error_test.rename(columns={'area_numbe': 'community_area'}).drop(columns=['area', 'shape_area','perimeter','area_num_1','comarea_id','comarea','shape_len'])
error_test
Out[103]:
community community_area geometry abs_percent_error
0 DOUGLAS 35 MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ... 46.250908
1 OAKLAND 36 MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ... 24.615683
2 FULLER PARK 37 MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ... 29.125332
3 GRAND BOULEVARD 38 MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ... 19.231095
4 KENWOOD 39 MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ... 70.387232
... ... ... ... ...
72 MOUNT GREENWOOD 74 MULTIPOLYGON (((-87.69646 41.70714, -87.69644 ... 55.974961
73 MORGAN PARK 75 MULTIPOLYGON (((-87.64215 41.68508, -87.64249 ... 19.973673
74 OHARE 76 MULTIPOLYGON (((-87.83658 41.98640, -87.83658 ... 47.622471
75 EDGEWATER 77 MULTIPOLYGON (((-87.65456 41.99817, -87.65456 ... 2.512832
76 EDISON PARK 9 MULTIPOLYGON (((-87.80676 42.00084, -87.80676 ... 221.550299

77 rows × 4 columns

In [104]:
error_test_hex = h3fy(error_test, resolution=8)
/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/util/util.py:141: FutureWarning:

The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.

In [105]:
# Plot all features in the same figure
fig, ax = plt.subplots(ncols=1,nrows=1, figsize=(7,7))

# MAE
ax=ax
MAE_hex = area_interpolate(source_df=error_test, target_df=error_test_hex, intensive_variables=['abs_percent_error'])
MAE_hex = MAE_hex.to_crs(3857)
MAE_hex.plot('abs_percent_error', cmap='magma_r', alpha=0.6, linewidth=0.6, edgecolor="white", ax=ax, legend=True, legend_kwds=dict(loc='upper right'),scheme="Quantiles", k=10)

ctx.add_basemap(ax=ax, zoom='auto', source=ctx.providers.Stamen.TonerBackground)
ax.axis('off')
ax.set_title('Distribution of MAE',fontsize=15)

plt.savefig('hexgrids_MAE')
/var/folders/7b/8810qk1521x704xftzhzw8_40000gn/T/ipykernel_96566/4013683622.py:6: UserWarning:

Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.


/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/tobler/area_weighted/area_interpolate.py:308: UserWarning:

Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.


/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:740: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/geopandas/plotting.py:74: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.

/Users/paprika/opt/anaconda3/envs/musa-550-fall-2022/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning:

distutils Version classes are deprecated. Use packaging.version instead.